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1  Executive  Summary 

The  SUIF  parallelizing  compiler  research  project  has  made  several  major  contributions  to  the  field  of  compil¬ 
er  techniques  for  high-performance  computing.  We  have  developed  a  large  suite  of  new  compiler  techniques 
that  are  proven  to  be  critical  to  effective  parallelization:  they  include  interprocedural  analysis  techniques 
for  finding  coarse-grain  parallelism,  affine  program  and  data  transforms  to  improve  the  memory  subsystem 
performance,  and  a  program  analysis  for  disambiguating  pointer  variables  which  is  a  prerequisite  to  paral¬ 
lelizing  any  C  programs.  We  have  also  improved  the  interaction  between  the  compiler  and  the  operating 
system  to  further  enhance  the  memory  subsystem  performance.  All  the  compiler  techniques  have  been  imple¬ 
mented  in  the  SUIF  parallelizing  compiler,  which  has  been  demonstrated  to  parallelize  many  more  programs 
effectively  when  compared  to  state-of-the-art  commercial  compilers.  We  have  also  developed  an  interactive 
parallelization  system  called  the  SUIF  Explorer,  which  guides  the  user  in  providing  additional  information  to 
improve  the  parallelization.  Finally,  we  have  developed  a  run-time  system  that  simplifies  the  programming 
of  networks  of  workstations  by  providing  the  programmer  with  the  abstraction  of  a  global  object  space  as 
well  as  fault  tolerance. 

The  compiler  techniques  we  developed  have  been  implemented  in  several  commercial  compilers  such 
as  the  SGI  and  Intel  compilers.  We  have  made  the  SUIF  compiler  infrastructure  publicly  available,  and 
it  has  been  used  by  many  different  institutions  for  research  on  a  large  number  of  different  topics.  These 
topics  include  scalar  optimizations,  parallelization,  memory  optimizations,  interprocedural  analysis,  code 
specialization,  partial  evaluation,  profile-driven  optimization,  code  generation,  VLIW  machines,  vector  ma¬ 
chines,  multimedia  processors,  digital  signal  processors,  embedded  RAM  processors,  and  reconfigurable  hard¬ 
ware.  Universities  using  the  compiler  system  for  research  or  teaching  purposes  include  Aachen  University 
of  Technology,  Colorado  State  University,  Dresden  University  of  Technology,  Georgia  Institute  of  Tech¬ 
nology,  Harvard  University,  HE-CNAM,  Institut  National  Polytechnique  de  Grenoble,  MIT,  UC  Berkeley, 
Oregon  Graduate  Institute,  Princeton  University,  Seoul  National  University,  Stanford  University,  University 
of  Adelaide,  UC  Santa  Barbara,  University  of  Cincinnati,  University  of  Delaware,  University  of  Manitoba, 
University  of  Maryland,  University  of  Michigan,  University  of  Minnesota,  University  of  Sussex,  University  of 
Toronto,  University  of  Queensland,  University  of  Wisconsin,  Wayne  State  University,  and  Yale  University. 
Companies  and  research  institutions  that  use  SUIF  include  IRISA/INRIA,  Synopsys  Inc  and  USC/ISI.  The 
SUIF  compiler  has  been  selected  by  DARPA  and  NSF  as  a  basis  of  the  National  Compiler  Infrastructure 
system. 

The  project  has  also  produced  a  large  number  of  Ph.D.  graduates  and  furthered  the  education  of  a 
number  of  post-doctorates.  The  names  and  current  job  positions  of  the  members  of  the  research  project 
are:  Saman  Amarasinghe,  Assistant  Professor  at  MIT,  Jennifer  Anderson,  Researcher  at  Compaq  Western 
Research  Laboratories,  Amer  Diwan,  Assistant  Professor  at  University  of  Colorado,  Mary  Hall,  Researcher  at 
USC/ISI,  Martin  Rinard,  Assistant  Professor  at  MIT,  Daniel  Scales,  Researcher  at  Compaq  Western  Research 
Laboratories,  Chau-Wen  Tseng,  Assistant  Professor  at  University  of  Maryland,  and  Robert  Wilson,  Member 
of  Technical  Staff  at  Tensilica  Inc. 


1 


2  Interprocedural  analysis  for  parallelization 

Existing  commercially  available  parallelizing  compilers  are  not  effective  at  getting  good  performance  on 
multiprocessors.  As  these  parallelizers  were  developed  from  vectorizing  compiler  technology,  they  tend 
to  be  successful  in  parallelizing  only  innermost  loops.  Parallelizing  just  inner  loops  is  not  adequate  for 
multiprocessors  for  two  reasons.  First,  inner  loops  may  not  make  up  a  significant  portion  of  the  sequential 
computation,  thus  limiting  the  parallel  speedup  by  limiting  the  amount  of  parallelism.  Second,  synchronizing 
processors  at  the  end  of  the  inner  loops  leaves  little  computation  occurring  in  parallel  between  synchronization 
points.  The  cost  of  frequent  synchronization  and  load  imbalance  can  potentially  overwhelm  the  benefits  of 
parallelization. 

We  have  developed  an  automatic  parallelization  system  that  is  fully  interprocedural[7,  1,  6].  The  system 
incorporates  all  the  standard  analyses  included  in  today’s  automatic  parallelizers,  such  as  data  dependence 
analysis,  analyses  of  scalar  variables  including  scalar  constant  propagation,  value  numbering,  induction 
variable  recognition,  scalar  privatization  scalar  dependence  and  reduction  recognition.  In  addition,  the 
system  employs  analyses  for  array  privatization  and  array  reduction  recognition.  The  implementation  of  these 
techniques  extends  previous  work  to  meet  the  demands  of  parallelizing  real  programs.  The  interprocedural 
analysis  is  designed  to  be  practical  while  providing  nearly  the  same  quality  of  analysis  as  if  the  program 
were  fully  inlined.  Our  system  has  been  shown  to  be  capable  of  finding  parallelism  in  codes  spanning  over  a 
thousand  lines  of  code  and  many  different  procedures. 

We  have  demonstrated  that  interprocedural  array  data-flow  analysis,  array  privatization,  and  reduction 
recognition  are  key  technologies  that  greatly  improve  the  success  of  automatic  parallelization.  By  finding 
coarse-grain  parallelism,  the  compiler  increases  parallelization  coverage,  lowers  synchronization  costs  and 
improves  speedups.  Through  our  work,  we  discovered  that  the  effectiveness  of  an  inter  procedural  paralleliza¬ 
tion  system  depends  on  the  strength  of  all  the  individual  analyses,  and  their  ability  to  work  together  in  an 
integrated  fashion.  This  comprehensive  approach  to  parallelization  analysis  is  why  our  system  has  been  so 
much  more  effective  at  automatic  parallelization  than  previous  interprocedural  systems  and  commercially 
available  compilers. 

3  Affine  Program  Transform 

We  have  developed  a  new  affine  framework  and  algorithms  to  improve  parallelism  and  locality [10, 11, 12].  An 
affine  partitioning  framework  unifies  many  useful  program  transforms  such  as  unimodular  transformations 
(interchange,  reversal,  skewing),  loop  fusion,  fission,  scaling,  reindexing,  and  statement  reordering. 

Based  on  this  framework,  we  have  developed  an  algorithm  that  maximizes  parallelism  while  minimizing 
communication  in  programs  with  arbitrary  loop  nestings  and  affine  data  accesses.  Our  algorithm  can  find 
the  optimal  affine  partition  that  maximizes  the  degree  of  parallelism  with  the  minimum  degree  of  synchro¬ 
nizations.  In  addition,  it  uses  a  greedy  algorithm  to  minimize  communication  between  loops  heuristically  by 
aligning  the  computation  partitions  for  different  loops,  trading  off  excess  degrees  of  parallelism,  and  choosing 
pipelined  parallelism  over  doall  parallelism  if  it  can  significantly  reduce  the  communication.  The  algorithm 
is  optimal  in  maximizing  the  degrees  of  parallelism  that  require  (1)  no  communication,  (2)  near-neighbor 
communication  and  a  constant  number  of  synchronizations,  and  (3)  near-neighbor  communication  and  0(n) 
synchronizations  where  n  is  the  number  of  iterations  in  a  loop. 

Our  algorithm  subsumes  previous  work  that  uses  loop  transforms  (unimodular,  loop  fusion,  fission,  seed¬ 
ing,  reindexing,  and  statement  reordering)  as  well  as  previous  data  and  computation  distribution  and  barrier 
synchronization  elimination  algorithms. 


4  Automatic  Data  Transform 

Effective  memory  hierarchy  utilization  is  critical  to  the  performance  of  modern  multiprocessor  architectures. 
We  have  developed  the  first  compiler  system  that  fully  automatically  parallelizes  sequential  programs  and 
changes  the  original  array  layouts  to  improve  memory  system  performance^,  3].  Our  optimization  algorithm 
consists  of  two  steps.  The  first  step  chooses  the  parallelization  and  computation  assignment  such  that 
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synchronization  and  data  sharing  are  minimized.  The  second  step  then  restructures  the  layout  of  the  data 
in  the  shared  address  space  with  an  algorithm  that  is  based  on  a  new  data  transformation  framework.  We 
ran  our  compiler  on  a  set  of  application  programs  and  measured  their  performance  on  the  Stanford  DASH 
multiprocessor.  Our  results  show  that  the  compiler  can  effectively  optimize  parallelism  in  conjunction  with 
memory  subsystem  performance. 

This  work  is  useful  for  purposes  other  than  translating  sequential  code  to  shared  memory  multiproces¬ 
sors.  Our  algorithm  to  determine  how  to  parallelize  and  distribute  the  computation  and  data  is  useful  also  to 
distributed  address  space  machines.  Our  data  transformation  framework,  consisting  of  the  strip-mining  and 
permuting  primitives,  is  applicable  to  layout  optimization  for  uniprocessors.  Finally,  our  data  transformation 
algorithm  can  also  apply  to  HPF  programs.  While  HPF  directives  are  originally  intended  for  distributed 
address  space  machines,  our  algorithm  uses  the  information  to  make  data  accessed  by  each  processor  con¬ 
tiguous  in  the  shared  address  space.  In  this  way,  the  compiler  achieves  locality  of  reference,  while  taking 
advantage  of  the  cache  hardware  to  provide  memory  management  and  coherence  functions. 


5  Pointer  alias  analysis 

Pointer  analysis  promises  significant  benefits  for  optimizing  and  parallelizing  compilers,  yet  despite  much 
recent  progress  it  has  not  advanced  beyond  the  research  stage.  Several  problems  remain  to  be  solved  before 
it  can  become  a  practical  tool.  First,  the  analysis  must  be  efficient  without  sacrificing  the  accuracy  of  the 
results.  Second,  pointer  analysis  algorithms  must  handle  real  C  programs.  If  an  analysis  only  provides 
correct  results  for  well-behaved  input  programs,  it  will  not  be  widely  used.  We  have  developed  a  pointer 
analysis  algorithm  that  addresses  these  issues. 

We  have  developed  a  fully  context-sensitive  pointer  analysis  algorithm  and  have  shown  that  it  is  very 
efficient  for  a  set  of  C  programs[16, 17].  To  make  context  sensitivity  feasible,  we  have  developed  the  concept  of 
partial  transfer  functions  which  minimizes  re-analysis  of  a  procedure  by  capturing  the  results  of  the  analysis 
in  a  parameterized  manner  for  a  subset  of  the  domain.  The  algorithm  is  based  on  the  simple  intuition  that 
the  aliases  among  the  inputs  to  a  procedure  are  the  same  in  most  calling  contexts.  Even  though  it  is  difficult 
to  summarize  the  behavior  of  a  procedure  for  all  inputs,  we  can  find  partial  transfer  functions  for  the  input 
aliases  encountered  in  the  program.  This  allows  us  to  analyze  a  procedure  once  and  reuse  the  results  in 
many  other  contexts. 

Even  though  our  algorithm  is  still  exponential  in  the  worst  case,  we  have  so  far  found  that  it  performs 
well.  As  long  as  most  procedures  are  always  called  with  the  same  alias  patterns,  our  algorithm  will  continue 
to  avoid  exponential  behavior.  To  be  safe,  after  reaching  some  limit  on  the  number  of  PTFs  per  procedure, 
we  could  easily  generalize  the  PTFs  instead  of  creating  new  ones. 

Our  analysis  can  handle  all  the  features  of  the  C  language.  We  make  conservative  assumptions  where 
necessary  to  ensure  that  our  results  are  safe.  Even  though  we  may  occasionally  lose  some  precision  due  to 
these  conservative  assumptions,  we  believe  it  is  important  to  handle  the  kinds  of  code  found  in  real  programs, 
even  if  they  do  not  strictly  conform  to  the  ANSI  standard. 

6  Interactions  with  operating  systems 

We  have  developed  a  new  technique,  compiler-directed  page  coloring,  that  eliminates  conflict  misses  in 
multiprocessor  applications  [5].  It  enables  applications  to  make  better  use  of  the  increased  aggregate  cache 
size  available  in  a  multiprocessor.  This  technique  uses  the  compiler’s  knowledge  of  the  access  patterns  of 
the  parallelized  applications  to  direct  the  operating  system’s  virtual  memory  page  mapping  strategy.  We 
demonstrate  that  this  technique  can  lead  to  significant  performance  improvements  over  two  commonly  used 
page  mapping  strategies  for  machines  with  either  direct-mapped  or  two-way  set-associative  caches.  We  also 
show  that  it  is  complementary  to  latency-hiding  techniques  such  as  prefetching. 

We  implemented  compiler-directed  page  coloring  in  the  SUIF  parallelizing  compiler  and  on  two  commer¬ 
cial  operating  systems.  We  applied  the  technique  to  the  SPEC95fp  benchmark  suite,  a  representative  set 
of  numeric  programs.  We  used  the  SimOS  machine  simulator  to  analyze  the  applications  and  isolate  their 
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performance  bottlenecks.  We  also  validated  these  results  on  a  real  machine,  an  eight-processor  350MHz  Dig¬ 
ital  AlphaServer.  Compiler-directed  page  coloring  leads  to  significant  performance  improvements  for  several 
applications.  Overall,  our  technique  improves  the  SPEC95fp  rating  for  eight  processors  by  8%  over  Digital 
UNIX’s  page  mapping  policy  and  by  20%  over  a  page  coloring,  a  standard  page  mapping  policy.  The  SUIF 
compiler  achieves  a  SPEC95fp  ratio  of  63.84  on  an  8-processor  440Mhz  AlphaServer,  which  was  the  highest 
ratio  at  that  time. 


7  Evaluation  of  compiler  on  whole  applications 

All  the  compiler  techniques  developed  have  been  implemented  in  the  SUIF  compiler  system.  The  system 
can  find  coarse-grain  parallel  loops  previously  not  found  by  any  other  automatic  systems  because  of  its  large 
collection  of  advanced  interprocedural  parallelization  analysis.  Moreover,  it  is  able  to  get  high  absolute 
performance  out  of  the  parallel  code  because  of  its  high-level  data  and  loop  transformations  to  improve 
memory  subsystem  performance.  These  techniques  have  a  significant  impact  on  the  performance  of  half  of 
the  NAS  and  SPECfp95  benchmark  suites.  It  outperforms  the  state-of-the-art  commercial  compiler  by  50% 
in  parallelizing  the  SPEC95fp  programs[2,  8]. 

For  some  programs,  our  analysis  is  sufficient  to  find  the  available  parallelism.  For  other  programs,  it 
seems  impossible  or  unlikely  that  a  purely  static  analysis  could  discover  parallelism — either  because  correct 
parallelization  requires  dynamic  information  not  available  at  compile  time  or  because  it  is  too  difficult  to 
analyze.  In  such  cases,  we  can  benefit  from  some  support  for  run-time  parallelization  or  user  interaction. 
The  aggressive  static  parallelizer  we  have  built  provides  a  good  starting  point  to  investigate  these  techniques. 


8  The  SUIF  Explorer:  An  Interactive  Parallelizer 

While  the  interprocedural  parallelization  analysis  can  find  some  coarse-grain  loops  to  parallelize,  the  proce¬ 
dure,  however,  is  fragile,  as  a  single  dependence  in  a  large,  otherwise  parallel,  loop  can  ruin  the  program’s 
parallel  performance.  Even  a  compiler  that  included  every  single  conceivable  parallelization  technique  would 
be  inadequate,  because  compilers  are  fundamentally  limited  by  the  sequential  semantics  originally  coded 
into  the  program.  Often  times,  it  requires  application-specific  knowledge  to  modify  the  algorithm  to  make 
it  parallelizable.  The  SUIF  Explorer  is  an  interactive  parallelization  tool  that  guides  the  user  in  supplying 
additional  information  so  as  to  extend  the  capability  of  the  interprocedural  analysis[9]. 

The  SUIF  Explorer  is  more  effective  than  previous  systems  in  minimizing  the  number  of  lines  of  code 
that  require  programmer  assistance.  First,  the  interprocedural  analyses  in  the  SUIF  system  is  successful 
in  parallelizing  many  coarse-grain  loops,  thus  minimizing  the  number  of  spurious  dependences  requiring 
attention.  We  found  that  the  dependences  left  unresolved  by  the  SUIF  parallelizer  are  mostly  nontrivial 
and  are  deserving  of  human  attention.  Second,  the  system  uses  dynamic  execution  analyzers  to  identify 
those  important  loops  that  are  likely  to  be  parallelizable.  This  greatly  reduces  the  number  of  loops  that  the 
programmer  needs  to  pay  attention  to.  Third,  the  SUIF  Explorer  is  the  first  to  apply  program  slicing  to  aid 
programmers  in  interactive  parallelization.  Slicing  reduces  the  number  of  lines  that  need  to  be  analyzed  and 
minimizes  the  likelihood  for  human  error.  The  system  guides  the  programmer  in  the  parallelization  process 
using  a  set  of  sophisticated  visualization  techniques. 

Our  experience  with  the  prototype  SUIF  Explorer  system  suggests  that  the  tool  can  be  very  effective 
in  improving  the  parallel  performance  of  large  applications.  We  demonstrate  the  effectiveness  on  three 
programs.  The  MDG  benchmark  improves  from  no  speedup  at  all  to  a  6-times  speedup  on  8  processors, 
Arc3d  improves  from  1.6  to  4.9  and  Hydro  improves  from  2.7  to  4.3. 

9  Shared  address  memory  system  for  networks  of  heterogeneous 
workstations 

Distributed  memory  systems,  especially  in  the  form  of  networks  of  workstations,  are  an  important  compu¬ 
tational  resource.  However,  programming  distributed  memory  machines  using  commonly  available  message- 
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passing  libraries  is  a  difficult  process.  The  difficulties  become  even  greater  for  very  sophisticated  scientific 
applications  that  have  highly  irregular  parallelism  and  communication.  We  have  developed  a  portable  run¬ 
time  system  called  SAM  which  provides  the  user  with  the  abstraction  of  a  global  name  space  with  the 
efficiency  of  automatic  caching  of  shared  data[14].  SAM  incorporates  mechanisms  to  address  the  problem 
of  high  communication  overheads  on  distributed  memory  machines;  these  mechanisms  include  tying  syn¬ 
chronization  to  data  access,  chaotic  access  to  data,  prefetching  of  data,  and  pushing  of  data  to  remote 
processors. 

We  found  that  the  performance  of  our  SAM  applications  depends  fundamentally  on  the  scalability  of  the 
underlying  parallel  algorithm,  and  whether  the  algorithm’s  communication  requirements  can  be  satisfied  by 
the  hardware.  Our  experience  suggests  that  SAM  is  successful  in  allowing  programmers  to  use  distributed 
memory  machines  effectively  with  much  less  programming  effort  than  required  previously. 

The  SAM  system  also  provides  fault  tolerance,  which  is  important  for  large-scale  systems  and  in  envi¬ 
ronments  where  application  users  do  not  have  absolute  control.  SAM  supports  fault  tolerance  efficiently  by 
ensuring  that  data  are  replicated  on  more  than  one  workstation  using  the  dynamic  caching  already  provided 
by  SAM.  Our  method  is  efficient  as  it  avoids  expensive  writes  to  disk  and  does  not  require  a  common  file 
server,  and  each  process  checkpoints  independently  and  only  when  sending  data  which  is  not  reproducible 
to  another  process[13]. 
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